Corpus for Coreference Resolution on Scientific Papers
نویسندگان
چکیده
The ever-growing number of published scientific papers prompts the need for automatic knowledge extraction to help scientists keep up with the state-of-the-art in their respective fields. To construct a good knowledge extraction system, annotated corpora in the scientific domain are required to train machine learning models. As described in this paper, we have constructed an annotated corpus for coreference resolution in multiple scientific domains, based on an existing corpus. We have modified the annotation scheme from Message Understanding Conference to better suit scientific texts. Then we applied that to the corpus. The annotated corpus is then compared with corpora in general domains in terms of distribution of resolution classes and performance of the Stanford Dcoref coreference resolver. Through these comparisons, we have demonstrated quantitatively that our manually annotated corpus differs from a general-domain corpus, which suggests deep differences between general-domain texts and scientific texts and which shows that different approaches can be made to tackle coreference resolution for general texts and scientific texts.
منابع مشابه
Corpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملCorefrence resolution with deep learning in the Persian Labnguage
Coreference resolution is an advanced issue in natural language processing. Nowadays, due to the extension of social networks, TV channels, news agencies, the Internet, etc. in human life, reading all the contents, analyzing them, and finding a relation between them require time and cost. In the present era, text analysis is performed using various natural language processing techniques, one ...
متن کاملA Fully Coreference-annotated Corpus of Scholarly Papers from the ACL Anthology
We describe a large coreference annotation task performed on a corpus of 266 papers from the ACL Anthology, a publicly, electronically available collection of scientific papers in the domain of computational linguistics and language technology. The annotation comprises mainly noun phrase coreference of the full textual content of each paper in the Anthology subset. It has been performed careful...
متن کاملCoreference Resolution: A Survey
Coreference resolution is the task of resolving noun phrases to the entities that they refer to. Much work has been done in the past in this area and the related area of anaphora resolution. In this paper, we present a literature survey that is divided into two broad categories. Discussed first are papers that are linguistically motivated based on syntax, focus and Centering theory. We then dis...
متن کاملResolving Coreferent and Associative Noun Phrases in Scientific Text
We present a study of information status in scientific text as well as ongoing work on the resolution of coreferent and associative anaphora in two different scientific disciplines, namely computational linguistics and genetics. We present an annotated corpus of over 8000 definite descriptions in scientific articles. To adapt a state-of-the-art coreference resolver to the new domain, we develop...
متن کامل